Loop through xmls and check if contents are in a csv - python-3.x

I have a .csv containing a list of names, I'm trying to check if those names are contained within a bunch of .xmls on a directory. I've tried my best to make the code open each .xml, check if the name inside it, matches one in my list csv.
my csv has no headers, is just the column of names. Examples of the names:
epsilon-prod-tps
display-eng-sl
alantest-prod-ab
So I need the code to open an xml, check if the name listed inside the .xml is inside my csv, close the xml, and move to the next one ... recording any that dont match of course. The xml part works, and so does the check if in csv part. I'm just struggling to combine the two so they work together. if that makes sense.
My code is as follows:
import os
from xml.etree import cElementTree as ET
InputPath = open('//auditdrive.local/audittest/Oisin/py/Auditor/List.csv', 'r')
string_append = ''
file_path = r'\\prod.mfg\xmlfolder'
directory = os.listdir(file_path)
for fname in directory:
if os.path.isfile(file_path + os.sep + fname + os.sep + fname+'.xml.'):
with open(file_path + os.sep + fname+ os.sep +fname+'.xml.', 'r') as xml:
product_count += 1
print(file_path + os.sep + fname+ os.sep +fname+'.xml.')
tree = ET.parse(xml)
root = tree.getroot()
for recipe in root.findall('RecName'):
rec_name = root.find('RecName').text
print( rec_name, 'extracted from - ', fname+ '.xml')
goodlist = InputPath.read()
print(' Checking for match in /List.csv')
for i in goodlist:
if rec_name in goodlist:
print(' PASS')
else:
print(' FAIL ...')
print('Invalid name in %s' % fname)
string_append = string_append + file_path + os.sep + fname+ os.sep +fname+'.xml.' + ' ,'
xml.close()
Currently it just prints the following:
PASS
PASS
PASS
PASS
PASS
PASS
PASS
PASS
\\prod.mfg\xmlfolder\projecta\projecta.xml.
epsilon-prod-tps extracted from - projecta.xml
Checking for match in /List.csv
\\prod.mfg\xmlfolder\projectb\projectb.xml.
display-eng-sl extracted from - projectb.xml
Checking for match in /List.csv
\\prod.mfg\xmlfolder\projectc\projectc.xml.
alantest-prod-ab extracted from - projectc.xml
Checking for match in /List.csv
It seems to loop through my csv, print a pass each of the rows in my csv .... (approx 200)
then it does the other for loop and doesnt check if theyre in the csv at all.
This is my first python project, sorry for any errors or mistakes in my question

Related

using python to parse through files for data

I have two files one template file and one file which has the values for the template file. I am trying to take the template file and then pass values to the variables from another file and combine the two into a third file. I am able to copy one file to another using the following snippet of code
`
print("Enter the Name of Source File: ")
sFile = input()
print("Enter the Name of Target File: ")
tFile = input()
fileHandle = open(sFile, "r")
texts = fileHandle.readlines()
fileHandle.close()
fileHandle = open(tFile, "w")
for s in texts:
fileHandle.write(s)
fileHandle.close()
print("\nFile Copied Successfully!")
`
however I am not sure how to do it for two or more files and then to make them into one file. Any help/guidance is appreciated
This is certainly not the most elegant solution but I think it should work for you.
# You could add as many files to this list as you want.
list_of_files = []
count = 1
while True:
print(f"Enter the Name of Source File{count} (Enter blank when done adding files): ")
sFile = input()
# If the input is not empty then add the filename to list_of_files.
if sFile:
list_of_files.append(sFile)
count += 1
else:
break
print("Enter the Name of Target File: ")
tFile = input()
# With open will open the file and then close if when done.
with open(tFile, 'a+') as target:
# This will loop over all the files in your list.
for file in list_of_files:
tmp = open(file, 'r')
target.write('\n' + tmp.read())
tmp.close()

Python 3.7: Batch renaming numbered files in a directory while preserving their sequence

I'm relatively new to Python, and have only recently started trying to use it for data analysis. I have a list of image files in a directory that have been acquired in sequence, and they have been named as so:
IMG_E5.1.tif
IMG_E5.2.tif
IMG_E5.3.tif
...
...
IMG_E5.107.tif
I would like to replace the dot and the number following it with an underscore and a four-digit integer, while preserving the initial numbering of the file, like so:
IMG_E5_0001.tif
IMG_E5_0002.tif
IMG_E5_0003.tif
...
...
IMG_E5_0107.tif
Could you advise me on how this can be done, or if there is already an answer that I'm not aware, link me to it? Many thanks!
I managed to find a method that works for this
import os
import os.path as path
from glob import glob
# Get current working directory
file_path = os.getcwd()
file_list = []
for i in range(1, 500):
# Generate file name (with wildcards) to search for
file_name = path.abspath(file_path + "/IMG*" + "." + str(i) + ".tif")
# Search for files
file = glob(file_name)
# If found, append to list
if len(file) > 1:
file_list.append(file[0])
elif len(file) == 1:
file_list.append(file[0])
for file in file_list:
# Use the "split" function to split the string at the periods
file_name, file_num, file_ext = file.split(".")
file_new = path.abspath(file_name + "_"
+ str(file_num).zfill(4)
+ "." + file_ext)
os.rename(file, file_new)
I am still relatively inexperienced with coding, so if there is a more straightforward and efficient way to tackle this problem, do let me know. Thanks.

Convert multiple .txt files into single .csv file (python)

I need to convert a folder with around 4,000 .txt files into a single .csv with two columns:
(1) Column 1: 'File Name' (as specified in the original folder);
(2) Column 2: 'Content' (which should contain all text present in the corresponding .txt file).
Here you can see some of the files I am working with.
The most similar question to mine here is this one (Combine a folder of text files into a CSV with each content in a cell) but I could not implement any of the solutions presented there.
The last one I tried was the Python code proposed in the aforementioned question by Nathaniel Verhaaren but I got the exact same error as the question's author (even after implementing some suggestions):
import os
import csv
dirpath = 'path_of_directory'
output = 'output_file.csv'
with open(output, 'w') as outfile:
csvout = csv.writer(outfile)
csvout.writerow(['FileName', 'Content'])
files = os.listdir(dirpath)
for filename in files:
with open(dirpath + '/' + filename) as afile:
csvout.writerow([filename, afile.read()])
afile.close()
outfile.close()
Other questions which seemed similar to mine (for example, Python: Parsing Multiple .txt Files into a Single .csv File?, Merging multiple .txt files into a csv, and Converting 1000 text files into a single csv file) do not solve this exact problem I presented (and I could not adapt the solutions presented to my case).
I had a similar requirement and so I wrote the following class
import os
import pathlib
import glob
import csv
from collections import defaultdict
class FileCsvExport:
"""Generate a CSV file containing the name and contents of all files found"""
def __init__(self, directory: str, output: str, header = None, file_mask = None, walk_sub_dirs = True, remove_file_extension = True):
self.directory = directory
self.output = output
self.header = header
self.pattern = '**/*' if walk_sub_dirs else '*'
if isinstance(file_mask, str):
self.pattern = self.pattern + file_mask
self.remove_file_extension = remove_file_extension
self.rows = 0
def export(self) -> bool:
"""Return True if the CSV was created"""
return self.__make(self.__generate_dict())
def __generate_dict(self) -> defaultdict:
"""Finds all files recursively based on the specified parameters and returns a defaultdict"""
csv_data = defaultdict(list)
for file_path in glob.glob(os.path.join(self.directory, self.pattern), recursive = True):
path = pathlib.Path(file_path)
if not path.is_file():
continue
content = self.__get_content(path)
name = path.stem if self.remove_file_extension else path.name
csv_data[name].append(content)
return csv_data
#staticmethod
def __get_content(file_path: str) -> str:
with open(file_path) as file_object:
return file_object.read()
def __make(self, csv_data: defaultdict) -> bool:
"""
Takes a defaultdict of {k, [v]} where k is the file name and v is a list of file contents.
Writes out these values to a CSV and returns True when complete.
"""
with open(self.output, 'w', newline = '') as csv_file:
writer = csv.writer(csv_file, quoting = csv.QUOTE_ALL)
if isinstance(self.header, list):
writer.writerow(self.header)
for key, values in csv_data.items():
for duplicate in values:
writer.writerow([key, duplicate])
self.rows = self.rows + 1
return True
Which can be used like so
...
myFiles = r'path/to/files/'
outputFile = r'path/to/output.csv'
exporter = FileCsvExport(directory = myFiles, output = outputFile, header = ['File Name', 'Content'], file_mask = '.txt')
if exporter.export():
print(f"Export complete. Total rows: {exporter.rows}.")
In my example directory, this returns
Export complete. Total rows: 6.
Note: rows does not count the header if present
This generated the following CSV file:
"File Name","Content"
"Test1","This is from Test1"
"Test2","This is from Test2"
"Test3","This is from Test3"
"Test4","This is from Test4"
"Test5","This is from Test5"
"Test5","This is in a sub-directory"
Optional parameters:
header: Takes a list of strings that will be written as the first line in the CSV. Default None.
file_mask: Takes a string that can be used to specify the file type; for example, .txt will cause it to only match .txt files. Default None.
walk_sub_dirs: If set to False, it will not search in sub-directories. Default True.
remove_file_extension: If set to False, it will cause the file name to be written with the file extension included; for example, File.txt instead of just File. Default True.

How to write output of os.walk() to a file in python 3

Below Python code will read the "/home/sam" and traverse it using os.walk().
The three attributes that we get from os.walk(), that will be read using the "for" loop and then will be written to the file "Dir_traverse_date.txt"
My problem is when the program is done executing the code. The only word written to the file "Dir_traverse_date.txt" is -- None.
How to fix this ? How to get the output of the function into the text file
================================CODE=====================================
import os
def dir_trav():
os.chdir("/home/sam")
print("Current Directory", os.getcwd())
for dirpath,dirname,filename in os.walk(os.getcwd()):
print ("Directory Path ----> ", dirpath)
print ("Directory Name ----> ", dirname)
print ("File Name ----> ", filename)
return
funct_out=dir_trav()
new_file=open('Dir_traverse_date.txt','w')
new_file.write(str(funct_out))
new_file.close()
========================================================================
In Python return must be followed by the object you wish the function to return. You can begin by manually placing a hard coded string in the return line. For example return "To Sender" Your file should now contain the text "To Sender" instead of "None". Try this with a few other strings or even numbers. Regardless of where you run os.walk your output will always be the same. What matters is what you place beside return.
Your goal is to construct a string from the the data gathered for you by os.walk and return it. I see that you are already printing some of the data. Let's begin fixing this by just gathering file names. Start off with an empty string and then accumulate your output with the += operator.
def dir_trav():
os.chdir("/home/sam")
print("Current Directory", os.getcwd())
output = ''
for dirpath, dirname, filename in os.walk(os.getcwd()):
output += filename
return output
Now, you'll notice that your output will change to include filenames, but they'll all be stuck together end to end (e.g. file1file2file3) This is because we need to ensure that we insert newlines at after each piece of data we are extracting.
def dir_trav():
os.chdir("/home/sam")
print("Current Directory", os.getcwd())
output = ''
for dirpath, dirname, filename in os.walk(os.getcwd()):
output += filename + '\n'
return output
From this point you should be able to move closer to the results you were looking for. String concatenation (+) is not the most efficient method for building strings from multiple data, but it will serve your purposes.
Note: Functions in Python can return multiple values, but they are technically a bound in a single object that is essentially a tuple.
You didn't return anything. Aren't functions NoneTypes?
import os
def dir_trav():
os.chdir("/home/sam")
print("Current Directory: ", os.getcwd())
data = []
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
for name in filenames:
filename = os.path.join(dirpath, name)
data.append(filename)
return data
new_file = open('Dir_traverse_date.txt', 'w')
for filename in dir_trav():
new_file.write(filename)
new_file.write('\n');
new_file.close()

Python changing file name

My application offers the ability to the user to export its results. My application exports text files with name Exp_Text_1, Exp_Text_2 etc. I want it so that if a file with the same file name pre-exists in Desktop then to start counting from this number upwards. For example if a file with name Exp_Text_3 is already in Desktop, then I want the file to be created to have the name Exp_Text_4.
This is my code:
if len(str(self.Output_Box.get("1.0", "end"))) == 1:
self.User_Line_Text.set("Nothing to export!")
else:
import os.path
self.txt_file_num = self.txt_file_num + 1
file_name = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt" + "_" + str(self.txt_file_num) + ".txt")
file = open(file_name, "a")
file.write(self.Output_Box.get("1.0", "end"))
file.close()
self.User_Line_Text.set("A text file has been exported to Desktop!")
you likely want os.path.exists:
>>> import os
>>> help(os.path.exists)
Help on function exists in module genericpath:
exists(path)
Test whether a path exists. Returns False for broken symbolic links
a very basic example would be create a file name with a formatting mark to insert the number for multiple checks:
import os
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
#the "{}" is a formatting mark so we can do file_name.format(num)
num = 1
while os.path.exists(name_to_format.format(num)):
num+=1
new_file_name = name_to_format.format(num)
this would check each filename starting with Exp_Txt_1.txt then Exp_Txt_2.txt etc. until it finds one that does not exist.
However the format mark may cause a problem if curly brackets {} are part of the rest of the path, so it may be preferable to do something like this:
import os
def get_file_name(num):
return os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_" + str(num) + ".txt")
num = 1
while os.path.exists(get_file_name(num)):
num+=1
new_file_name = get_file_name(num)
EDIT: answer to why don't we need get_file_name function in first example?
First off if you are unfamiliar with str.format you may want to look at Python doc - common string operations and/or this simple example:
text = "Hello {}, my name is {}."
x = text.format("Kotropoulos","Tadhg")
print(x)
print(text)
The path string is figured out with this line:
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
But it has {} in the place of the desired number. (since we don't know what the number should be at this point) so if the path was for example:
name_to_format = "/Users/Tadhg/Desktop/Exp_Txt_{}.txt"
then we can insert a number with:
print(name_to_format.format(1))
print(name_to_format.format(2))
and this does not change name_to_format since str objects are Immutable so the .format returns a new string without modifying name_to_format. However we would run into a problem if out path was something like these:
name_to_format = "/Users/Bob{Cat}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Bobcat{}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Smiley{:/Desktop/Exp_Txt_{}.txt"
Since the formatting mark we want to use is no longer the only curly brackets and we can get a variety of errors:
KeyError: 'Cat'
IndexError: tuple index out of range
ValueError: unmatched '{' in format spec
So you only want to rely on str.format when you know it is safe to use. Hope this helps, have fun coding!

Resources