Can't Glob Several File Types - python-3.x

I've researched and tested this issue for a while and can't seem to get it to work.
user_path
Is provided by the user and it contains .xlsm, ,xlsb and .xlsx file types. I'm trying to catch all of them and convert them to .csv. This works individually if I substitute the extensions:
all_files = glob.glob(os.path.join(user_path, "*.xlsm")) #xlsb, xlsm
I've tried the following two methods, neither of which work (win32com just tells me Excel can't access the out_folder.)
all_files = glob.glob(os.path.join(user_path, "*"))
all_files = glob.glob(user_path)
How can I send these two file types together with user_path?
Thanks in advance.

By using just *, glob matches all files AND directories under the given folder, including those you have no access to, which in your case is the out_folder directory, so when you iterate over the file names, make sure if they end with one of the file extensions you're looking for before you try to open them.
Since glob can't test for multiple file extensions at a time, it's actually better to use os.listdir and do the filtering of multiple file extensions on your own.
for filename in os.listdir(user_path):
if any(map(filename.endswith, ('.xlsm', '.xlsb', '.xlsx'))):
do_something(filename)
Or, with list comprehension,
all_files = [filename for filename in os.listdir(user_path) if any(map(filename.endswith, ('.xlsm', '.xlsb', '.xlsx')))]
Edit by the OP (actual code):
pathlib.Path(path + '\out_folder').mkdir(parents = True, exist_ok = True)
newpath = os.path.join(path,'out_folder')
#this is the line I can't seem to get to read both file types - it works as is.
all_files_test = glob.glob(os.path.join(user_path, "*.xlsm")) #xlsb, xlsm
for file in all_files_test:
name1 = os.path.splitext(os.path.split(file)[1])[0]

Related

Python filepaths have double backslashes

Ultimately, I want to loop through every pdf in specified directory ('C:\Users\dude\pdfs_for_parsing') and print the metadata for each pdf. The issue is that when I try to loop through the "directory" I'm receiving the error "FileNotFoundError: [Errno 2] No such file or directory:". I understand this error is occurring because I now have double slashes in my filepaths for some reason.
Example Code
import PyPDF2
import os
path_of_the_directory = r'C:\Users\dude\pdfs_for_parsing'
directory = []
ext = ('.pdf')
def isolate_pdfs():
for files in os.listdir(path_of_the_directory):
if files.endswith(ext):
x = os.path.abspath(files)
directory.append(x)
for pdf in directory:
reader = PyPDF2.PdfReader(pdf)
information = reader.metadata
print(information)
isolate_pdfs()
If I print the file paths one at a time, I see that the files have single '/' like I'm expecting:
for pdf in directory:
print(pdf)
The '//' seems to get added when I try to open each of the PDFs 'PDFFile = open(pdf,'rb')'
Your issue has nothing to do with //, it's here:
os.path.abspath(files)
Say you have C:\Users....\x.pdf, you list that directory, so the files will contain x.pdf. You then take the absolute path of x.pdf, which the abspath supposes to be in the current directory. You should replace it with:
x = os.path.join(path_of_the_directory, files)
Other notes:
PDFFile and PDF shouldn't be in uppercase. Prefer pdf_file and pdf_reader. The latter also avoids the confusion with the for pdf in...
Try to use a debugger rather than print statements. This is how I found your bug. It can be in your IDE or in command line with python -i You can step through your code, test a few variations, fiddle with the variables...
Why is ext = ('.pdf') with braces ? It doesn't do anything but leads to think that it might be a tuple (but isn't).
As an exercise the first for can be written as: directory = [os.path.join(path_of_the_directory, x) for x in os.listdir(path_of_the_directory) if x.endswith(ext)]

How to copy merge files of two different directories with different extensions into one directory and remove the duplicated ones

I would need a Python function which performs below action:
I have two directories which in one of them I have files with .xml format and in the other one I have files with .pdf format. To simplify things consider this example:
Directory 1: a.xml, b.xml, c.xml
Directory 2: a.pdf, c.pdf, d.pdf
Output:
Directory 3: a.xml, b.xml, c.xml, d.pdf
As you can see the priority is with the xml files in the case that both extensions have similar names.
I would be thankful for your help.
You need to use the shutil module and the os module to achieve this. This function will work on the following assumption:
A given directory has all files with the same extension
The priority_directory will be the directory with file extensions to be prioritized
The secondary_directory will be the directory with file extensions to be dropped in case of a name collision
Try:
import os,shutil
def copy_files(priority_directory,secondary_directory,destination = "new_directory"):
file_names = [os.path.splitext(filename)[0] for filename in os.listdir(priority_directory)] # get the file names to check for collisions
os.mkdir(destination) # make a new directory
for file in os.listdir(priority_directory): # this loop copies the first direcotory as it is
file_path = os.path.join(priority_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
for file in os.listdir(secondary_directory): # this loop checks for collisions and drops files whose name collide
if(os.path.splitext(file)[0] not in file_names):
file_path = os.path.join(secondary_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
print(os.listdir(destination))
Let's run it with your direcotry names as arguments:
copy_files('directory_1','directory_2','directory_3')
You can now check a new directory with the name directory_3 will be created with the desired files in it.
This will work for all such similar cases no matter what the extension is.
Note: There should not be a need to do this i guess cause a directory can have two files with the same name as long as the extensions differ.
Rough working solution:
import os
from shutil import copy2
d1 = './d1/'
d2 = './d2/'
d3 = './d3/'
ext_1 = '.xml'
ext_2 = '.pdf'
def get_files(d: str, files: list):
directory = os.fsencode(d)
for file in os.listdir(d):
dup = False
filename = os.fsdecode(file)
if filename[-4:] == ext_2:
for (x, y) in files:
if y == filename[:-4] + ext_1:
dup = True
break
if dup:
continue
files.append((d, filename))
files = []
get_files(d1, files)
get_files(d2, files)
for d, file in files:
copy2(d+file, d3)
I'll see if I can get it to look/perform better.

Create folders dynamically and write csv files to that folders

I would like to read several input files from a folder, perform some transformations,create folders on the fly and write the csv to corresponding folders. The point here is I have the input path which is like
"Input files\P1_set1\Set1_Folder_1_File_1_Hour09.csv" - for a single patient (This file contains readings of patient (P1) at 9th hour)
Similarly, there are multiple files for each patient and each patient files are grouped under each folder as shown below
So, to read each file, I am using wildcard regex as shown below in code
I have already tried using the glob package and am able to read it successfully but am facing issue while creating the output folders and saving the files. I am parsing the file string as shown below
f = "Input files\P1_set1\Set1_Folder_1_File_1_Hour09.csv"
f[12:] = "P1_set1\Set1_Folder_1_File_1_Hour09.csv"
filenames = sorted(glob.glob('Input files\P*_set1\*.csv'))
for f in filenames:
print(f) #This will print the full path
print(f[12:]) # This print the folder structure along with filename
df_transform = pd.read_csv(f)
df_transform = df_transform.drop(['Format 10','Time','Hour'],axis=1)
df_transform.to_csv("Output\" + str(f[12:]),index=False)
I expect the output folder to have the csv files which are grouped by each patient under their respective folders. The screenshot below shows how the transformed files should be arranged in output folder (same structure as input folder). Please note that "Output" folder is already existing (it's easy to create one folder you know)
So to read files in a folder use os library then you can do
import os
folder_path = "path_to_your_folder"
dir = os.listdir(folder_path)
for x in dir:
df_transform = pd.read_csv(f)
df_transform = df_transform.drop(['Format 10','Time','Hour'],axis=1)
if os.path.isdir("/home/el"):
df_transform.to_csv("Output/" + str(f[12:]),index=False)
else:
os.makedirs(folder_path+"/")
df_transform.to_csv("Output/" + str(f[12:]),index=False)
Now instead of user f[12:] split the x in for loop like
file_name = x.split('/')[-1] #if you want filename.csv
Let me know if this is what you wanted

Rename files according to list

I'm trying to rename files in a directory using a list. My code so far will only rename the first file before giving me a FileNotFoundError. How can I read the list and rename my files in the same order as it?
import os
import glob
fileLib = ('/filepath1/')
ref = ('/filepath2/ref.csv')
for file in glob.glob(os.path.join(fileLib, '*.csv')):
with open(ref) as list1:
line = list1.read().split(',\n')
for name in line:
os.rename(file, os.path.join(fileLib, '{}.csv'.format(name)))
You're applying the rename to the same file, since the loops are nested.
So the first time it works, and the next time it tries to rename a file that has been already renamed.
Reorganize your code. First, read the new names file:
fileLib = '/filepath1/'
ref = '/filepath2/ref.csv'
with open(ref) as list1:
newnames = list1.read().split(',\n')
then zip directory contents and the new names list together with a single loop:
for file,newname in zip(glob.glob(os.path.join(fileLib, '*.csv')),newnames):
os.rename(file, os.path.join(fileLib, '{}.csv'.format(newname)))
Since zip stops when one of the iterable parameters is exhausted, if the glob result is longer than the new names list, renaming will be done only partially, so it would be better to check that both lists have the same size prior to renamining.

how do i open a file in python with variable name?

I am trying to open a file that has a random time date stamp as part of its name. Most of the filename is known eg filename = 2017_01_23_624.txt is my test name. The date and numbers is the part I am trying to replace with something unknown as it will change.
My question relates to opening an existing file and not creating a new filename. The filenames are created by a separate program and I must be able to open them. I also want to change them once they are open.
I have been trying to construct the filename string as follow but get invalid syntax.
filename = "testfile" + %s + ".txt"
print(filename)
fo = open(filename)
You can't open a file with a partial filename containing a wildcard for it to match. What you would have to do it look at all the files in your directory and pick the one that matches best.
Simple example:
import os
filename = "filename2017_" # known section
direc = r"directorypath"
matches = [ fname for fname in os.listdir(direc) if fname.startswith(filename) ]
print(matches)
You can however use the glob module (Thanks to bzimor) to pattern match your files. See glob docs for more info.
import glob, os
# ? - represents any single character in the filename
# * - represents any number of characters in the filename
direc = r"directorypath"
pattern = 'filename2017-??-??-*.txt'
matches = glob.glob(os.path.join(direc, pattern))
print(matches)
Similar to first solution is that you still get back a list of filenames to choose from to then open. But with the glob module you can more accurately match your files if you so need. It all depends on how tight you want it to be.

Resources