Problem using glob: file not found after os.path.join() - python-3.x

I met strange problem using glob (python 3.10.0/Linux):
if I use glob for location of the required file using following construct:
def get_last_file(folder, date=datetime.today().date()):
os.chdir(folder)
_files = glob.glob("*.csv")
_files.sort(key=os.path.getctime)
os.chdir(os.path.join("..", ".."))
for _filename in _files[::-1]:
string = str(date).split("-")
if "".join(string) in _filename:
return _filename
# if cannot find the specific date, return newest file
return _files[-1]
but when I try to
os.path.join(fileDir, file)
with the resulting file, I get the relative path which leads to:
FileNotFoundError: [Errno 2] No such file or directory: 'data/1109.csv'.
File certainly exist and whet i try os.path.join(fileDir, '1109.csv'), file is found.
The weirdest thing - if i do:
filez = get_last_file(fileDir, datetime.today().date())
file = '1109.csv''
I still get file not found for file after os.path.join(fileDir, file).
Should I avoid using glob at all?

I made such solution:
file =''
_mtime=0
for root, dirs, filenames in os.walk(fileDir):
for f in sorted(filenames):
if f.endswith(".csv"):
if os.path.getmtime(fileDir+f) > _mtime:
_mtime = os.path.getmtime(fileDir+f)
file = f
print (f'fails {file}')
and the resulting os.path.join(fileDir, file) gives (relative) path fit for further operations
Also the difference between getctime and getmtime is accounted for.

While not a direct solution, try looking at Python's Pathlib library. It often leads to cleaner, less buggy solutions.
from pathlib import Path
def get_last_file(folder, date=datetime.today().date()):
folder = pathlib.Path(folder) # Works for both relative and absolute paths
_files = Path.cwd().glob("*.csv")
_files.sort(key=os.path.getctime)
grandparent_path = folder.parents[1]
for _filename in _files[::-1]:
string = str(date).split("-")
if "".join(string) in _filename:
return _filename
# if cannot find the specific date, return newest file
return _files[-1]
Then instead of using os.path.join() you can do path_dir / file_name where path_dir is Path object. This may also be the case that you are changing the base path in within your function, leading to unexpected behaviour.

Related

Find Files with the term "deadbolt" in it and return only first subfolder with os.walk

This script gets a term and a path to a folder. Its goal is then to search in every subfolder for files that contain the term "deadbolt" in it and make a list and return that list.
So far so good but at the end I want to delete the first subfolder of where the script found a deadbolt file.
So for example I do have following folder structure:
d:/Movies/
├─ Movie1/subfolder1Movie1/subfolder2Movie1/movie1.mp4.deadbolt
├─ Movie2/subfolder1Movie2/subfolder2Movie2/movie2.mpeg
├─ Movie3/subfolder1Movie3/subfolder2Movie3/movie3.avi.deadbolt
In this case I provide the path "D:\Movies" and the term "deadbolt" and want the script to return ["Movie1","Movie3"].
Because I want to delete those folder structures completely. With there subfolders and files. But how can I achieve to get the first subfolder where a file was found without regex?
import os
import re
def findDeadbolts(searchTerm,search_path):
results = []
for root, dir, files in os.walk(search_path, topdown=True):
for filename in files:
if searchTerm in filename:
fullPath = os.path.join(root, filename)
results.append(fullPath)
pattern="(?<=Movies\\\\)[a-zA-Z0-9\_\-\!\?]+" #Dont wont to do it with regex since names can be qutie complex
print(re.search(pattern,fullPath)[0])
return results
print(findDeadbolts('deadbolt','D:\\Movies'))
I found a solution for this.
Using the method "parts" from "pathlib.Path". This gives me every part of a path. Since I know the root path I can get both lengths with len() and count the length of the root +1 or since it starts counting by 0 I just take the length of root path and this will work.
from pathlib import Path
import os
from shutil import rmtree
foundPath = "D:\Movies\Movie2\Movie2.avi.deadbolt"
rootPath = "D:\Movies"
foundParts = Path(foundPath).parts
rootParts = Path(rootPath).parts
folder = foundParts[len(rootParts)]
rmtree(os.path.join(rootPath,folder))
If there is a better solution for this comment below. ;-)

Python filepaths have double backslashes

Ultimately, I want to loop through every pdf in specified directory ('C:\Users\dude\pdfs_for_parsing') and print the metadata for each pdf. The issue is that when I try to loop through the "directory" I'm receiving the error "FileNotFoundError: [Errno 2] No such file or directory:". I understand this error is occurring because I now have double slashes in my filepaths for some reason.
Example Code
import PyPDF2
import os
path_of_the_directory = r'C:\Users\dude\pdfs_for_parsing'
directory = []
ext = ('.pdf')
def isolate_pdfs():
for files in os.listdir(path_of_the_directory):
if files.endswith(ext):
x = os.path.abspath(files)
directory.append(x)
for pdf in directory:
reader = PyPDF2.PdfReader(pdf)
information = reader.metadata
print(information)
isolate_pdfs()
If I print the file paths one at a time, I see that the files have single '/' like I'm expecting:
for pdf in directory:
print(pdf)
The '//' seems to get added when I try to open each of the PDFs 'PDFFile = open(pdf,'rb')'
Your issue has nothing to do with //, it's here:
os.path.abspath(files)
Say you have C:\Users....\x.pdf, you list that directory, so the files will contain x.pdf. You then take the absolute path of x.pdf, which the abspath supposes to be in the current directory. You should replace it with:
x = os.path.join(path_of_the_directory, files)
Other notes:
PDFFile and PDF shouldn't be in uppercase. Prefer pdf_file and pdf_reader. The latter also avoids the confusion with the for pdf in...
Try to use a debugger rather than print statements. This is how I found your bug. It can be in your IDE or in command line with python -i You can step through your code, test a few variations, fiddle with the variables...
Why is ext = ('.pdf') with braces ? It doesn't do anything but leads to think that it might be a tuple (but isn't).
As an exercise the first for can be written as: directory = [os.path.join(path_of_the_directory, x) for x in os.listdir(path_of_the_directory) if x.endswith(ext)]

How to copy merge files of two different directories with different extensions into one directory and remove the duplicated ones

I would need a Python function which performs below action:
I have two directories which in one of them I have files with .xml format and in the other one I have files with .pdf format. To simplify things consider this example:
Directory 1: a.xml, b.xml, c.xml
Directory 2: a.pdf, c.pdf, d.pdf
Output:
Directory 3: a.xml, b.xml, c.xml, d.pdf
As you can see the priority is with the xml files in the case that both extensions have similar names.
I would be thankful for your help.
You need to use the shutil module and the os module to achieve this. This function will work on the following assumption:
A given directory has all files with the same extension
The priority_directory will be the directory with file extensions to be prioritized
The secondary_directory will be the directory with file extensions to be dropped in case of a name collision
Try:
import os,shutil
def copy_files(priority_directory,secondary_directory,destination = "new_directory"):
file_names = [os.path.splitext(filename)[0] for filename in os.listdir(priority_directory)] # get the file names to check for collisions
os.mkdir(destination) # make a new directory
for file in os.listdir(priority_directory): # this loop copies the first direcotory as it is
file_path = os.path.join(priority_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
for file in os.listdir(secondary_directory): # this loop checks for collisions and drops files whose name collide
if(os.path.splitext(file)[0] not in file_names):
file_path = os.path.join(secondary_directory,file)
dst_path = os.path.join(destination,file)
shutil.copy(file_path,dst_path)
print(os.listdir(destination))
Let's run it with your direcotry names as arguments:
copy_files('directory_1','directory_2','directory_3')
You can now check a new directory with the name directory_3 will be created with the desired files in it.
This will work for all such similar cases no matter what the extension is.
Note: There should not be a need to do this i guess cause a directory can have two files with the same name as long as the extensions differ.
Rough working solution:
import os
from shutil import copy2
d1 = './d1/'
d2 = './d2/'
d3 = './d3/'
ext_1 = '.xml'
ext_2 = '.pdf'
def get_files(d: str, files: list):
directory = os.fsencode(d)
for file in os.listdir(d):
dup = False
filename = os.fsdecode(file)
if filename[-4:] == ext_2:
for (x, y) in files:
if y == filename[:-4] + ext_1:
dup = True
break
if dup:
continue
files.append((d, filename))
files = []
get_files(d1, files)
get_files(d2, files)
for d, file in files:
copy2(d+file, d3)
I'll see if I can get it to look/perform better.

FileNotFoundError long file path python - filepath longer than 255 characters

Normally I don't ask questions, because I find answers on this forum. This place is a goldmine.
I am trying to move some files from a legacy storage system(CIFS Share) to BOX using python SDK. It works fine as long as the file path is less than 255 characters.
I am using os.walk to pass the share name in unix format to list files in the directory
Here is the file name.
//dalnsphnas1.mydomain.com/c$/fs/hdrive/home/abcvodopivec/ENV Resources/New Regulation Review/Regulation Reviews and Comment Letters/Stormwater General Permits/CT S.W. Gen Permit/PRMT0012_FLPR Comment Letter on Proposed Stormwater Regulations - 06-30-2009.pdf
I also tried to escape the file, but still get FileNotFoundError, even though file is there.
//dalnsphnas1.mydomain.com/c$/fs/hdrive/home/abcvodopivec/ENV Resources/New Regulation Review/Regulation Reviews and Comment Letters/Stormwater General Permits/CT S.W. Gen Permit/PRMT0012_FLPR\ Comment\ Letter\ on\ Proposed\ Stormwater\ Regulations\ -\ 06-30-2009.pdf
So I tried to shorten the path using win32api.GetShortPathName, but it throws the same FileNotFoundError. This works fine on files with path length less than 255 characters.
Also tried to copy the file using copyfile(src, dst) to another destination folder to overcome this issue, and still get the same error.
import os, sys
import argparse
import win32api
import win32con
import win32security
from os import walk
parser = argparse.ArgumentParser(
description='Migration Script',
)
parser.add_argument('-p', '--home_path', required = True, help='Home Drive Path')
args = vars(parser.parse_args())
if args['home_path']:
pass
else:
print("Usage : script.py -p <path>")
print("-p <directory path>/")
sys.exit()
dst = (args['home_path'] + '/' + 'long_file_path_dir')
for dirname, dirnames, filenames in os.walk(args['home_path']):
for filename in filenames:
file_path = (dirname + '/' + filename)
path_len = len(file_path)
if(path_len > 255):
#short_path = win32api.GetShortPathName(file_path)
copyfile(file_path, dst, follow_symlinks=True)
After a lot of trial and error, figured out the solution (thanks to stockoverflow forum)
switched from unix format to UNC path
Then appending each file generated through os.walk with r'\\?\UNC' like below. UNC path starts with two backward slashes, I have to remove one to make it to work
file_path = (r'\\?\UNC' + file_path[1:])
Thanks again for everyone who responded.
Shynee

Can you not assign a variable to a fnmatch function in python?

I have the following code, it prints out the file but it doesn't assign it to the variable file so that i can open it
for file in os.listdir('C:\\Users\\####\\Documents\\Visual Studio 2015\\Projects\\Data\\'):
if fnmatch.fnmatch(file, '*.csv'):
scanReport = open(file)
scanReader = csv.reader(scanReport)
fnmatch doesn't (and cannot) expand file into the proper path. It's just a wildcard pattern test.
os.listdir returns the file names not the file paths. match the filename (as you already do) but provide full path to open using os.path.join with your source directory:
the_dir = r'C:\Users\####\Documents\Visual Studio 2015\Projects\Data'
for file in os.listdir(the_dir):
if fnmatch.fnmatch(file, '*.csv'):
scanReport = open(os.path.join(the_dir,file))
or maybe it's better to use glob.glob in that case to get filter & absolute path at the same time.
import glob
for file in glob.glob(r'C:\Users\####\Documents\Visual Studio 2015\Projects\Data\*.csv'):
scanReport = open(file)

Resources